Skip to content

Feature/remove large in clause in assets with cte and join#62114

Open
Nataneljpwd wants to merge 18 commits intoapache:mainfrom
Nataneljpwd:feature/remove-large-in-clause-in-assets
Open

Feature/remove large in clause in assets with cte and join#62114
Nataneljpwd wants to merge 18 commits intoapache:mainfrom
Nataneljpwd:feature/remove-large-in-clause-in-assets

Conversation

@Nataneljpwd
Copy link
Contributor


Closes: #61453
This issue solves the large in clause using a cte with a join rather than batching

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
  • No

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Feb 18, 2026
@Nataneljpwd Nataneljpwd force-pushed the feature/remove-large-in-clause-in-assets branch from e2000b8 to 9110b93 Compare February 18, 2026 11:41
@Nataneljpwd Nataneljpwd marked this pull request as draft February 18, 2026 20:11
@Nataneljpwd Nataneljpwd marked this pull request as ready for review February 24, 2026 07:16
Natanel Rudyuklakir added 2 commits February 24, 2026 21:47
@Nataneljpwd Nataneljpwd force-pushed the feature/remove-large-in-clause-in-assets branch from 2192078 to ebc08dd Compare February 24, 2026 20:20
Copy link

@Asquator Asquator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement, in clauses should always be avoided on values residing in the DB if possible.

Comment on lines +3025 to -3037
select(AssetModel)
.outerjoin(DagScheduleAssetReference)
.outerjoin(TaskOutletAssetReference)
.outerjoin(TaskInletAssetReference)
.group_by(AssetModel.id)
.order_by(orphaned)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider extracting this to a helper function in asset.py, like many others located there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see a benefit from doing it, and so I did not do it, do you have a reason for the request? as I might have missed something

Comment on lines +3052 to +3054
active_assets_query = select(AssetActive.name, AssetActive.uri).join(
assets_query,
and_(AssetActive.name == assets_query.c.name, AssetActive.uri == assets_query.c.uri),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a helper function in asset.py too?
Just to avoid adding even more logic into the scheduler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

and_(AssetActive.name == assets_query.c.name, AssetActive.uri == assets_query.c.uri),
)

active_assets = session.execute(active_assets_query).all()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are users with thousands of active assets, I wonder if this may explode one day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a good point, maybe it is out of scope of the given PR, I might open a new PR for this after to handle large scale, as if, batching, yet as of now it is not an issue, and so for now I will leave it as is

session.execute(
delete(AssetActive).where(
tuple_(AssetActive.name, AssetActive.uri).in_((a.name, a.uri) for a in assets)
def _orphan_unreferenced_assets(assets_query: CTE, *, session: Session) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can avoid passing a CTE as an argument (which is not intuitive) by using the helper function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you suggest then?
it had the least amount of duplicated code, if there are any suggestions, I would be happy to hear

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asset_reference_query is a static query that never changes. If it's referenced in two places, maybe it's worth extracting it as a helper function, again?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we won't be passing CTEs as method parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is harder to track that way in my opinion

way simpler to just see a query passed rather than go to a different method

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's harder to track. As it's a constant, reusable CTE, I would put it as a cached util function in the corresponding module instead of generating it in the scheduler code.

@Nataneljpwd Nataneljpwd force-pushed the feature/remove-large-in-clause-in-assets branch from d8e235d to ff0347f Compare February 26, 2026 16:08
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this — the SQL-shape change is in the right direction for #61453. I left two inline comments for follow-up before merge.

assets = select(AssetModel).where(assets_select_condition).cte()

if not AIRFLOW_V_3_2_PLUS:
assets = self.session.scalars(select(assets)).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Airflow <3.2 this fallback currently uses scalars(select(assets)) where assets is a CTE built from select(AssetModel). scalars() returns only the first selected column, so this becomes a list of IDs (not AssetModel objects). That can break _activate_referenced_assets when it expects .name / .uri. Could we keep the old pre-3.2 materialization query, or join the CTE back to AssetModel before calling scalars()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will join back to the asset model


asset_models = session.scalars(select(AssetModel)).all()
assert len(asset_models) == 3
asset_models = select(AssetModel).cte()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add an explicit regression assertion for #61453's failure mode (large tuple-IN bind expansion)? These tests now validate behavior with a CTE input, but they don't directly guard against reintroducing a huge (name, uri) IN (...) path in scheduler asset activation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think this can be added? As it does not cause failure when using in, rather just cause some slowdown

The only think I can think of is to check for the keyword 'in' for the str of the query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found a way to make it work with event listeners in sqlalchemy, added the test

@Nataneljpwd Nataneljpwd requested a review from Asquator February 28, 2026 13:35
@Nataneljpwd Nataneljpwd requested a review from kaxil February 28, 2026 14:31
@Nataneljpwd
Copy link
Contributor Author

Hello @kaxil, I have fixed the comments, I would appreciate a review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid large tuple IN query in SchedulerJobRunner._activate_referenced_assets on PostgreSQL (performance / perceived hanging)

3 participants